34 - Pattern Recognition [PR] - PR 30 [ID:23771]
50 von 151 angezeigt

Welcome back to pattern recognition. So today we want to look a bit more into the EM algorithm

and in particular we want to learn how to apply to other problems and we will start

with the so-called missing information principle.

Now the missing information principle is as simple as that the observable information

is given as the complete information minus the hidden information. Now let's look at this

in a little more mathematical way. So we can formalize the entire problem as the

observable random variable x, the hidden random variable y and the parameter set theta. Now we can

use this to formalize the joint probability density of the events x, the observable part

and the hidden part y as probability of x and y and parameters theta can be expressed as the

probability of x theta times the probability of y given x. Now we can reformulate this and we see

that our p of x and theta is given as the fraction of p of x and y and theta divided by p of y given

x. So now we can look at this and apply the negative logarithm so we can see that minus

the logarithm of p of x theta is equal to minus the logarithm of p of x y and theta minus minus

the logarithm of p of y given x and theta. Now let's rewrite this a little bit and the idea

that we want to introduce is we want to write it in an iterative iteration scheme. Now we can write

this essentially as the i plus 1 iteration and write this essentially in the same form but with

an additional iteration index and now our theta is essentially the estimate of our parameter vector

theta. So this is now indicating the parameter step. Now let's multiply both sides with p of y

given x and the parameter vector in the i-th iteration and we integrate over the entire

hidden event y. So we introduce the integral and we multiply with this additional probability.

Then this can be written as the integral on the left hand side and of course also on the right

hand side we can see that we can split the two integrals and we end up with something that is

the probability of y given x and the parameter vector theta times above logarithms and now you

already see that this looks kind of familiar. So let's look a little bit more on the left hand side

and here you can see that if we rearrange this a little bit so you can see that we can pull out

the logarithm and the probability of x and theta i plus 1 and this then can be essentially rearranged

as this logarithm because the integral over p of y given x over the entire domain of y is just going

to be 1. So we remain with the logarithm of the probability of x. So we can observe that the left

hand side of the equation is the log likelihood function of our observations. So we encountered

this term previously and we can now use this to express the log likelihood function.

So the maximization of the right hand side of our key equation here corresponds to a

maximum likelihood estimation. Now if you look at the terms on the right hand side we see that we

can introduce the following notation. Formally this is a little bit incorrect but it's just

a small alteration of the iteration indices and this then gives rise to the Kohlberg-Libert

statistics that can be expressed as Q of the parameter vector theta hat in iteration i and

iteration i plus 1 is given as the integral of p of y given x times the logarithm of p of x and y

and the integration over y. Furthermore we observe the entropy which is here defined as h between our

parameter vector theta hat at iteration i and iteration i plus 1 that is the negative integral

over the probability of y given x times the logarithm of the probability of y given x and

the parameter vector in the next iteration. Now let's first have a closer look at the Kohlberg-Libert

statistics. If we look at the statistics between theta and theta prime and we write out the

expression as an above equation then we can see that the Kohlberg-Libert statistics which are also

called the Q function with respect to theta prime and theta is nothing else than the conditional

expectation of the log likelihood of the joint probability of x and y.

Now let's have a look again at the key equation and you can see that we can now write it down

as the log likelihood function of x and this can be expressed as the Q statistics plus the entropy

of the old parameter set and the new parameter set. So now we want to motivate that the maximization

of the Kohlberg-Libert statistics can actually replace the optimization of the log likelihood

function. Note that we only look at this on a rather coarse level. A complete proof can be found

in the literature if you have a look at the further readings. So let's look a bit into the

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

00:20:01 Min

Aufnahmedatum

2020-11-14

Hochgeladen am

2020-11-15 00:37:39

Sprache

en-US

In this video, we analyze the expectation-maximization algorithm and relate it to Kullback-Leibler statistics in the context of the missing information principle.

This video is released under CC BY 4.0. Please feel free to share and reuse.

For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.

Music Reference: Damiano Baldoni - Thinking of You

Einbetten
Wordpress FAU Plugin
iFrame
Teilen